Designing spelling correctors for inflected languages using lexical transducers
نویسندگان
چکیده
This paper describes the components used in the design of the commercial X u x e n I I spelling checker/corrector for Basque. It is a new version of the Xuxen spelling corrector (Aduriz et al., 97) which uses lexical transducers to improve the process. A very important new feature is the use of user dictionaries whose entries can recognise both the original and inflected forms. In languages with a high level of inflection such as Basque spelling checking cannot be resolved without adequate treatment of words from a morphological standpoint. In addition to this, the morphological treatment has other important features: coverage, reusability of tools, orthogonality and security. The tool is based in lexical transducers and is built using the fst library of Inxight 1. A lexical transducer (Karttunen, 94) is a finite-state automaton that maps inflected surface forms to lexical forms, and can be seen as an evolution of twolevel morphology (Koskenniemi, 83) where the use of diacritics and homographs can be avoided and the intersection and composition of transducers is possible. In addition, the process is very fast and the transducer for the whole morphological description can be compacted in less than 1Mbyte. The design of the spelling corrector consists of four main modules:
منابع مشابه
A Two-level Morphological Analyser and Generator for Irish using Finite-State Transducers
Computational morphology is an important part of natural language processing. Finite-state techniques have been applied successfully in computational phonology and morphology to many of the world’s major languages. Celtic languages such as Modern Irish present challenging morphological features that to date have not been addressed using finite-state technology. This paper presents a finite-stat...
متن کاملLexical Analysis of Agglutinative Languages Using a Dictionary of Lemmas and Lexical Transducers
This paper presents a simple method for performing a lexical analysis of agglutinative languages like Korean, which have a heavy morphology. Especially, for nouns and adverbs with regular morphological modifications and/or high productivity, we do not need to artificially construct huge dictionaries of all inflected forms of lemmas. To construct a dictionary of lemmas and lexical transducers, f...
متن کاملUsing foma for language-based games
This paper describes two examples of how finite-state technology (FST) commonly used in computational morphology can help implement language-based games. The tool we have used is foma an open-source toolkit, similar to previous Xerox/PARC finite-state tools. FST tools have been widely used to describe the morphology of languages and to implement spelling checkers and correctors, especially for ...
متن کاملFrom Lexical Acquisition to Lexical Reusable Tools
Having as background the work in the definition and implementation of a system for the acquisition and management of reusable morphological and phrasal dictionaries, and the realization of a framework for the generation of different finite-state tools for an efficient and distributed use of the different functionalities defined in the system, we will present the overall system and focus the att...
متن کاملDesign and implementation of Persian spelling detection and correction system based on Semantic
Persian Language has a special feature (grapheme, homophone, and multi-shape clinging characters) in electronic devices. Furthermore, design and implementation of NLP tools for Persian are more challenging than other languages (e.g. English or German). Spelling tools are used widely for editing user texts like emails and text in editors. Also developing Persian tools will provide Persian progr...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1999